Spatial Ecology and Macroecology

Practical - Week 1

2024-09-30

What are we going to see today?

  1. Data types
  2. Data sources
  3. Open data
  4. Exercise 1: explore data sources
  5. Exercise 2: data download and cleaning in R (quality check)

Data types

1. Data types

Data that can place a particular taxa in a particular location and time can take many forms.

1. Data types

Presence-only (PO) data

  • For example: records from museum and herbarium collections or citizen-science initiatives
  • Characteristics: usually opportunistic, single species, spatio-temporally specific, absences are unknown

PROS CONS
huge amounts of data available, easily aggregated often without details of effort/method, wide variation in data quality

1. Data types

Presence-absence data

  • For example: data from inventories, checklists, atlas, acoustic sensors, DNA sampling or camera-trap surveys
  • Characteristics: multiple species, spatio-temporally specific, report searches that did not find the species (absences)

PROS CONS
absences are informative, area and effort are measured less abundant (too time consuming), methods are species-specific

1. Data types

Repeated surveys

  • For example: monitoring schemes, repeated atlas projects
  • Characteristics: multiple species, over time, spatially defined, use a standardized protocol

PRO CONS
standardised protocols, multiple points in time expensive: geographically restricted, usually temporally too

1. Data types

Range-maps

  • outlines of species distributions, IUCN ranges, field guides
  • single species, expert-drawn

PROS CONS
rough estimates of the outer boundaries of areas within which species are likely to occur large spatial and temporal uncertainties

1. Data types

Data can also be defined as how they were collected.

1. Data types

Structured

  • clear survey design (location, target) and standardised sampling protocol
  • site selection: preselected locations, sometimes stratified random
  • metadata: informs about the survey methods

1. Data types

Semi-structured

  • no survey design but little standardised sampling protocol
  • site selection: free
  • metadata: informs about the observation process and survey methods

1. Data types

Unstructured (opportunistic)

  • no survey design and no standardised sampling protocol
  • site selection: free
  • metadata: almost non

1. Data types

Finally, data can also be defined as how they are made available for others.

1. Data types

Disaggregated (König et al. 2021)

  • precision is high, but completeness and representativeness are low.

1. Data types

Aggregated (König et al. 2021)

  • precision is low, but completeness and representativeness are high.

2. Data sources

gbif.org

GBIF is an international network and data infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth.

rgbif: https://github.com/ropensci/rgbif

obis.org

OBIS is a global open-access data and information clearing-house on marine biodiversity for science, conservation and sustainable development.

robis: https://github.com/iobis/robis

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.

auk: https://cornelllabofornithology.github.io/auk/

ebird.org

eBird’s goal is to gather birdwatcher’s knowledge and experience in the form of checklists of birds, archive it, and freely share it to power new data-driven approaches to science, conservation and education.

rebird: https://github.com/ropensci/rebird

inaturalist.org

iNaturalist is one of the world’s most popular nature apps. It allows participants to contribute observations of any organism, or traces thereof, along with associated spatio-temporal metadata.

rinat: https://github.com/ropensci/rinat

mol.org

Map of Life endeavors to provide ‘best-possible’ species range information and species lists for any geographic area. The Map of Life assembles and integrates different sources of data describing species distributions worldwide.

iucnredlist.org

IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.

rredlist: https://github.com/ropensci/rredlist

iucnredlist.org

IUCN’s (International Union for Conservation of Nature) Red List of Threatened Species has evolved to become the world’s most comprehensive information source on the global extinction risk status of animal, fungus and plant species.

redlistr: https://github.com/red-list-ecosystem/redlistr

bien.nceas.ucsb.edu/bien/

BIEN is a network of ecologists, botanists, and computer scientists working together to document global patterns of plant diversity, function and distribution.

rbien: https://github.com/bmaitner/RBIEN

10.1016/j.dib.2017.05.007

Chorological maps for the main European woody species is a data paper with a dataset of chorological maps for the main European tree and shrub species, put together by Giovanni Caudullo, Erik Welk, and Jesús San-Miguel-Ayanz.

sibbr.gov.br

SiBBr (Brazilian Biodiversity Information System) is an online platform that integrates data and information about biodiversity and ecosystems from different sources, making them accessible for different uses.

sibbr: https://github.com/sibbr

UK bto.org/our-science/projects/breeding-bird-survey

USA usgs.gov/centers/eesc/science/north-american-breeding-bird-survey

BBS (Breeding Bird Survey) involves thousands of volunteer birdwatchers carrying out standardised annual bird counts on randomly-located 1-km sites. It’s part of the NBN Atlas.

biotime.st-andrews.ac.uk

BioTime is an open access database global database of assemblage time series for quantifying and understanding biodiversity change.

BioTime Hub: https://github.com/bioTIMEHub

3. Open Data

3. Open Data

Open means anyone can freely access, use, modify, and share for any purpose.


Public doesn’t mean open

The data on the internet can be public but they are not necessarily open. They can be standard, available in open formats (e.g., csv), and yet, if they don’t have a licence, by default they are closed (all rights reserved).

3. Open Data: Data standards

Darwin Core is the internationally agreed data standard to facilitate the sharing of information about biological diversity.

dwc.tdwg.org

countryCode: The standard code for the country in which the Location occurs. Recommended best practice is to use an ISO 3166-1-alpha-2 country code. 

recordedBy: A list (concatenated and separated) of names of people, groups, or organizations responsible for recording the original Occurrence.

3. Open Data: Licensing

Open data are licensed under open licenses. Some examples:


CC0: Public domain


CC-BY: Attribution


CC-BY-NC: Attribution - Non Commercial


CC-BY-SA: Attribution - Share Alike

3. Open Data: Data sharing

Data that are standardized and have an open licence can be shared :)

EXERCISE 1 Explore different data sources

Imagine you want to start a project:

Chose a taxon, chose one data source and try to get distribution data.

Then answer the following 3 questions:

  • What kind of data types does the source provide?
  • Which kind of taxa are covered by the database generally?
  • How accessible is the data? Can anyone download it? Restrictions?
  • What was your experience? What issues did you encounter while getting the data?

EXERCISE 2 Mammal’s of the Czech Republic

We will use the mammals of Czech Republic as an example dataset. We will access data through GBIF using tools available in R.

Some preparation before starting to code

  • Create a new project for all your practical sessions (with code and data folders inside).
  • Comment your code as much as possible, as if you were to explain it to others (that other could be you in 3 months!).
  • Keep your code short and easily readable in plain English.

File > New project > New directory or Existing directory

1 Install and load libraries

We will always load packages into R using the package pacman.

install.packages('pacman')
library(pacman) # load
packageVersion('pacman')
[1] '0.5.1'


If you attempt to load a library that is not installed, pacman will try to install it automatically.

1 Install and load libraries

We will use tidyverse for the manipulation and transformation of data.

pacman::p_load(tidyverse) # Data wrangling
packageVersion('tidyverse')
[1] '2.0.0'


We will be using many functions from this library of package, like filter(), mutate(), and later read_csv().

1 Install and load libraries

We will use rgbif to download data from GBIF directly into R.

pacman::p_load(rgbif) # the GBIF R package
packageVersion('rgbif')
[1] '3.8.0'

1 Install and load libraries

We will need to get a taxon ID (taxonKey) for the Mammalia class from the GBIF backbone. For that we will use another package called taxize.

pacman::p_load(taxize)
packageVersion('taxize')
[1] '0.9.100'

1 Install and load libraries

We will use sf to work with spatial data.

pacman::p_load(sf)
packageVersion('sf')
[1] '1.0.16'

1 Install and load libraries

We will use rnaturalearth to interact with Natural Earth to get mapping data into R (e.g., countries’ polygons).

pacman::p_load(rnaturalearth)
packageVersion('rnaturalearth')
[1] '1.0.1'

2 Project variables

Create some variables that will be used later.

taxa <- "Mammalia"
country_code <- "CZ" # Two letters ISO code for Czechia
proj_crs <- 4326 # EPSG code for WGS84

2 Project variables

Get a taxon ID for the Mammalia class.

taxon_key <- get_gbifid_(taxa) %>%
  bind_rows() %>% # Transform the result of get_gbifid into a dataframe
  filter(matchtype == "EXACT" & status == "ACCEPTED") %>% # Filter the dataframe by the columns "matchtype" and "status"
  pull(usagekey) # Pull the contents of the column "usagekey"

taxon_key
[1] 359

2 Project variables

Basemap of CZ to use later for plotting or checking the dataset.

base_map <- rnaturalearth::ne_countries(
  scale = 110,
  type = 'countries',
  country = 'czechia',
  returnclass = 'sf'
)

3 GBIF data download

And now we can use the function occ_count() to find out the number of occurrence records for the entire Czech Republic.

occ_count(
  taxonKey = NULL,
  georeferenced = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  date = NULL,
  typeStatus = NULL,
  country = NULL,
  year = NULL,
  from = 2000,
  to = 2012,
  type = "count",
  publishingCountry = "US",
  protocol = NULL,
  curlopts = list()
)

3 GBIF data download

How many occurrence records are in GBIF for the entire Czech Republic?

occ_count(country=country_code) # country code for Czech Republic (https://countrycode.org/)
[1] 4404250


And how many records for the mammals of Czech Republic?

occ_count(
  country = country_code,
  taxonKey = taxon_key
)
[1] 8097


We are ready to do a download. Whoop!

3.1 CZ mammals’ GBIF data download

To do this, we will use occ_search().

occ_search(
  taxonKey = NULL,
  scientificName = NULL,
  country = NULL,
  publishingCountry = NULL,
  hasCoordinate = NULL,
  typeStatus = NULL,
  recordNumber = NULL,
  lastInterpreted = NULL,
  continent = NULL,
  geometry = NULL,
  geom_big = "asis",
  geom_size = 40,
  geom_n = 10,
  recordedBy = NULL,
  recordedByID = NULL,
  identifiedByID = NULL,
  basisOfRecord = NULL,
  datasetKey = NULL,
  eventDate = NULL,
  catalogNumber = NULL,
  year = NULL,
  month = NULL,
  decimalLatitude = NULL,
  decimalLongitude = NULL,
  elevation = NULL,
  depth = NULL,
  institutionCode = NULL,
  collectionCode = NULL,
  hasGeospatialIssue = NULL,
  issue = NULL,
  search = NULL,
  mediaType = NULL,
  subgenusKey = NULL,
  repatriated = NULL,
  phylumKey = NULL,
  kingdomKey = NULL,
  classKey = NULL,
  orderKey = NULL,
  familyKey = NULL,
  genusKey = NULL,
  establishmentMeans = NULL,
  protocol = NULL,
  license = NULL,
  organismId = NULL,
  publishingOrg = NULL,
  stateProvince = NULL,
  waterBody = NULL,
  locality = NULL,
  limit = 500,
  start = 0,
  fields = "all",
  return = NULL,
  facet = NULL,
  facetMincount = NULL,
  facetMultiselect = NULL,
  skip_validate = TRUE,
  curlopts = list(),
  ...
)

3.1 CZ mammals’ GBIF data download

Get occurrence records of mammals from Czech Republic.

occ_search(taxonKey=taxon_key,
           country='CZ') 
Records found [8097] 
Records returned [500] 
No. unique hierarchies [36] 
No. media records [500] 
No. facets [0] 
Args [occurrenceStatus=PRESENT, limit=500, offset=0, taxonKey=359, country=CZ,
     fields=all] 
# A tibble: 500 × 100
   key        scientificName  decimalLatitude decimalLongitude issues datasetKey
   <chr>      <chr>                     <dbl>            <dbl> <chr>  <chr>     
 1 4518978086 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 2 4510103035 Sciurus vulgar…            49.8             13.4 cdc,c… 50c9509d-…
 3 4510305990 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 4 4510153353 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 5 4510362535 Microtus arval…            50.0             16.3 cdc,c… 50c9509d-…
 6 4510154668 Castor fiber L…            49.9             14.2 cdc,c… 50c9509d-…
 7 4510377266 Capreolus capr…            49.2             17.4 cdc,c… 50c9509d-…
 8 4510457308 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 9 4510279317 Capreolus capr…            49.4             15.7 cdc,c… 50c9509d-…
10 4512107228 Castor fiber L…            49.5             13.3 cdc,c… 50c9509d-…
# ℹ 490 more rows
# ℹ 94 more variables: publishingOrgKey <chr>, installationKey <chr>,
#   hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
#   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
#   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
#   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
#   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, …

By default it will only return the first 500 records

3.1 CZ mammals’ GBIF data download

To get all the records we need to specify a larger limit. Since we have over 8,000 records, we’ll choose 9,000 as the limit.

occ_search(taxonKey=taxon_key,
           country='CZ',
            limit=9000) 
Records found [8097] 
Records returned [8097] 
No. unique hierarchies [275] 
No. media records [8097] 
No. facets [0] 
Args [occurrenceStatus=PRESENT, limit=9000, offset=0, taxonKey=359, country=CZ,
     fields=all] 
# A tibble: 8,097 × 190
   key        scientificName  decimalLatitude decimalLongitude issues datasetKey
   <chr>      <chr>                     <dbl>            <dbl> <chr>  <chr>     
 1 4518978086 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 2 4510103035 Sciurus vulgar…            49.8             13.4 cdc,c… 50c9509d-…
 3 4510305990 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 4 4510153353 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 5 4510362535 Microtus arval…            50.0             16.3 cdc,c… 50c9509d-…
 6 4510154668 Castor fiber L…            49.9             14.2 cdc,c… 50c9509d-…
 7 4510377266 Capreolus capr…            49.2             17.4 cdc,c… 50c9509d-…
 8 4510457308 Myocastor coyp…            50.1             14.4 cdc,c… 50c9509d-…
 9 4510279317 Capreolus capr…            49.4             15.7 cdc,c… 50c9509d-…
10 4512107228 Castor fiber L…            49.5             13.3 cdc,c… 50c9509d-…
# ℹ 8,087 more rows
# ℹ 184 more variables: publishingOrgKey <chr>, installationKey <chr>,
#   hostingOrganizationKey <chr>, publishingCountry <chr>, protocol <chr>,
#   lastCrawled <chr>, lastParsed <chr>, crawlId <int>, basisOfRecord <chr>,
#   occurrenceStatus <chr>, taxonKey <int>, kingdomKey <int>, phylumKey <int>,
#   classKey <int>, orderKey <int>, familyKey <int>, genusKey <int>,
#   speciesKey <int>, acceptedTaxonKey <int>, acceptedScientificName <chr>, …

3.1 CZ mammals’ GBIF data download

Finally, we store the result in the object mammalsCZ.

mammalsCZ <- occ_search(
  taxonKey = taxon_key, # Key 359 created previously
  country = country_code, # CZ, ISO code of Czechia
  limit = 9000, # Max number of records to download
  hasGeospatialIssue = F # Only records without spatial issues
)

mammalsCZ <- mammalsCZ$data # The output of occ_search is a list with a data object inside. Here we pull the data out of the list.

4 Data exploration

Mammals occurrence records from the Czech Republic

glimpse(mammalsCZ)
Rows: 8,045
Columns: 189
$ key                                <chr> "4518978086", "4510103035", "451030…
$ scientificName                     <chr> "Myocastor coypus (Molina, 1782)", …
$ decimalLatitude                    <dbl> 50.08180, 49.75979, 50.08204, 50.08…
$ decimalLongitude                   <dbl> 14.41210, 13.35779, 14.41030, 14.40…
$ issues                             <chr> "cdc,cdround", "cdc,cdround", "cdc,…
$ datasetKey                         <chr> "50c9509d-22c7-4a22-a47d-8c48425ef4…
$ publishingOrgKey                   <chr> "28eb1a3f-1c15-4a95-931a-4af90ecb57…
$ installationKey                    <chr> "997448a8-f762-11e1-a439-00145eb45e…
$ hostingOrganizationKey             <chr> "28eb1a3f-1c15-4a95-931a-4af90ecb57…
$ publishingCountry                  <chr> "US", "US", "US", "US", "US", "US",…
$ protocol                           <chr> "DWC_ARCHIVE", "DWC_ARCHIVE", "DWC_…
$ lastCrawled                        <chr> "2024-09-20T15:47:23.061+00:00", "2…
$ lastParsed                         <chr> "2024-09-21T09:35:02.230+00:00", "2…
$ crawlId                            <int> 486, 486, 486, 486, 486, 486, 486, …
$ basisOfRecord                      <chr> "HUMAN_OBSERVATION", "HUMAN_OBSERVA…
$ occurrenceStatus                   <chr> "PRESENT", "PRESENT", "PRESENT", "P…
$ taxonKey                           <int> 4264680, 8211070, 4264680, 4264680,…
$ kingdomKey                         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ phylumKey                          <int> 44, 44, 44, 44, 44, 44, 44, 44, 44,…
$ classKey                           <int> 359, 359, 359, 359, 359, 359, 359, …
$ orderKey                           <int> 1459, 1459, 1459, 1459, 1459, 1459,…
$ familyKey                          <int> 3240572, 9456, 3240572, 3240572, 32…
$ genusKey                           <int> 3240573, 2437489, 3240573, 3240573,…
$ speciesKey                         <int> 4264680, 8211070, 4264680, 4264680,…
$ acceptedTaxonKey                   <int> 4264680, 8211070, 4264680, 4264680,…
$ acceptedScientificName             <chr> "Myocastor coypus (Molina, 1782)", …
$ kingdom                            <chr> "Animalia", "Animalia", "Animalia",…
$ phylum                             <chr> "Chordata", "Chordata", "Chordata",…
$ order                              <chr> "Rodentia", "Rodentia", "Rodentia",…
$ family                             <chr> "Myocastoridae", "Sciuridae", "Myoc…
$ genus                              <chr> "Myocastor", "Sciurus", "Myocastor"…
$ species                            <chr> "Myocastor coypus", "Sciurus vulgar…
$ genericName                        <chr> "Myocastor", "Sciurus", "Myocastor"…
$ specificEpithet                    <chr> "coypus", "vulgaris", "coypus", "co…
$ taxonRank                          <chr> "SPECIES", "SPECIES", "SPECIES", "S…
$ taxonomicStatus                    <chr> "ACCEPTED", "ACCEPTED", "ACCEPTED",…
$ iucnRedListCategory                <chr> "LC", "LC", "LC", "LC", "LC", "LC",…
$ dateIdentified                     <chr> "2024-01-12T10:43:43", "2024-01-04T…
$ coordinateUncertaintyInMeters      <dbl> 102, 2, 9, 3, NA, 3, 15, 4, 19, 3, …
$ continent                          <chr> "EUROPE", "EUROPE", "EUROPE", "EURO…
$ stateProvince                      <chr> "Prague", "Plzeňský", "Prague", "Pr…
$ year                               <int> 2024, 2024, 2024, 2024, 2024, 2024,…
$ month                              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day                                <int> 1, 2, 4, 5, 6, 6, 7, 8, 8, 10, 11, …
$ eventDate                          <chr> "2024-01-01T15:09:20", "2024-01-02T…
$ startDayOfYear                     <int> 1, 2, 4, 5, 6, 6, 7, 8, 8, 10, 11, …
$ endDayOfYear                       <int> 1, 2, 4, 5, 6, 6, 7, 8, 8, 10, 11, …
$ modified                           <chr> "2024-03-21T21:33:22.000+00:00", "2…
$ lastInterpreted                    <chr> "2024-09-21T09:35:02.230+00:00", "2…
$ references                         <chr> "https://www.inaturalist.org/observ…
$ license                            <chr> "http://creativecommons.org/license…
$ isSequenced                        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
$ identifier                         <chr> "195466906", "195735620", "19574596…
$ facts                              <chr> "none", "none", "none", "none", "no…
$ relations                          <chr> "none", "none", "none", "none", "no…
$ isInCluster                        <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, F…
$ datasetName                        <chr> "iNaturalist research-grade observa…
$ recordedBy                         <chr> "katjawil", "Míša Peterka", "Andrej…
$ identifiedBy                       <chr> "manumea2000", "Míša Peterka", "And…
$ geodeticDatum                      <chr> "WGS84", "WGS84", "WGS84", "WGS84",…
$ class                              <chr> "Mammalia", "Mammalia", "Mammalia",…
$ countryCode                        <chr> "CZ", "CZ", "CZ", "CZ", "CZ", "CZ",…
$ recordedByIDs                      <chr> "none", "none", "none", "none", "no…
$ identifiedByIDs                    <chr> "none", "none", "none", "none", "no…
$ gbifRegion                         <chr> "EUROPE", "EUROPE", "EUROPE", "EURO…
$ country                            <chr> "Czechia", "Czechia", "Czechia", "C…
$ publishedByGbifRegion              <chr> "NORTH_AMERICA", "NORTH_AMERICA", "…
$ rightsHolder                       <chr> "katjawil", "Míša Peterka", "Andrej…
$ identifier.1                       <chr> "195466906", "195735620", "19574596…
$ http...unknown.org.nick            <chr> "katjawil", "peterkam", "andrej_fun…
$ verbatimEventDate                  <chr> "2024-01-01 15:09:20+01:00", "2024/…
$ collectionCode                     <chr> "Observations", "Observations", "Ob…
$ verbatimLocality                   <chr> "Vltava, Prague 1, Prag, CZ", "Plze…
$ gbifID                             <chr> "4518978086", "4510103035", "451030…
$ occurrenceID                       <chr> "https://www.inaturalist.org/observ…
$ taxonID                            <chr> "43997", "46001", "43997", "43997",…
$ catalogNumber                      <chr> "195466906", "195735620", "19574596…
$ institutionCode                    <chr> "iNaturalist", "iNaturalist", "iNat…
$ eventTime                          <chr> "15:09:20+01:00", "13:09:00+01:00",…
$ http...unknown.org.captive         <chr> "wild", "wild", "wild", "wild", "wi…
$ identificationID                   <chr> "442516619", "440401646", "44043095…
$ name                               <chr> "Myocastor coypus (Molina, 1782)", …
$ recordedByIDs.type                 <chr> NA, NA, NA, NA, NA, NA, NA, "ORCID"…
$ recordedByIDs.value                <chr> NA, NA, NA, NA, NA, NA, NA, "https:…
$ identifiedByIDs.type               <chr> NA, NA, NA, NA, NA, NA, NA, "ORCID"…
$ identifiedByIDs.value              <chr> NA, NA, NA, NA, NA, NA, NA, "https:…
$ informationWithheld                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationRemarks              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ occurrenceRemarks                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lifeStage                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ sex                                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ individualCount                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ samplingProtocol                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ habitat                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ vernacularName                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locality                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationVerificationStatus   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventType                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ infraspecificEpithet               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ distanceFromCentroidInMeters       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ dataGeneralizations                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ datasetID                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ language                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ accessRights                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ recordNumber                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.taxonRankID     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ dynamicProperties                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxonConceptID                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ taxonRemarks                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventID                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ projectId                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismQuantity                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismQuantityType               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ otherCatalogNumbers                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ gadm                               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ associatedSequences                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ networkKeys                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ coordinatePrecision                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ institutionKey                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ acceptedNameUsage                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationRemarks                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferencedBy                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ collectionKey                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ preparations                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ institutionID                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ nomenclaturalCode                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ type                               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ disposition                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bibliographicCitation              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ collectionID                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.language        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ footprintWKT                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.modified        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ originalNameUsage                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ nameAccordingTo                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ elevation                          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ elevationAccuracy                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fieldNumber                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherGeography                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationAccordingTo                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferencedDate                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceProtocol               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimCoordinateSystem           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ organismID                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ previousIdentifications            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ identificationQualifier            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherClassification               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceSources                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ ownerInstitutionCode               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ materialEntityID                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ footprintSRS                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimIdentification             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ locationID                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceRemarks                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.recordID        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ county                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rights                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.recordEnteredBy <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ georeferenceVerificationStatus     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ establishmentMeans                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ parentNameUsage                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ island                             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ materialSampleID                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ associatedReferences               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ eventRemarks                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimElevation                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ higherGeographyID                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ combinationAuthors                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimScientificName             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ http...unknown.org.verbatimLabel   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ combinationYear                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ canonicalName                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEonOrLowestEonothem        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEonOrHighestEonothem         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEraOrLowestErathem         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEraOrHighestErathem          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestPeriodOrLowestSystem       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestPeriodOrHighestSystem        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestEpochOrLowestSeries        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestEpochOrHighestSeries         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ municipality                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ earliestAgeOrLowestStage           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ namePublishedInYear                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lithostratigraphicTerms            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ verbatimTaxonRank                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ latestAgeOrHighestStage            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ formation                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bed                                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ geologicalContextID                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Check the data output. How many rows and columns does it have?

4 Data exploration

Mammals occurrence records from the Czech Republic

How many records do we have?

nrow(mammalsCZ)
[1] 8045


How many species do we have?

mammalsCZ %>%
  filter(taxonRank == "SPECIES") %>%
  distinct(scientificName) %>%
  nrow()
[1] 180

distinct() is used to see unique values

5 Data quality

Data are not ‘good’ or ‘bad’, the quality will depend on our goal.
Some things we can check:

  • Base of the record (type of occurrence)
  • Species names (taxonomic harmonisation)
  • Spatial and temporal (accuracy / precision)

CoordinateCleaner: https://github.com/ropensci/CoordinateCleaner

Automated flagging of common spatial and temporal errors in data.

5.1 Basic data filtering

As an example of data cleaning procedures, we will check the following fields in our dataset:

  • basisOfRecord: we want preserved specimens or observations
  • taxonRank: we want records at species level.
  • coordinateUncertaintyInMeters: we want it to be smaller than 10km.

5.1 Basic data filtering

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>% distinct(basisOfRecord)
# A tibble: 7 × 1
  basisOfRecord     
  <chr>             
1 HUMAN_OBSERVATION 
2 OBSERVATION       
3 MATERIAL_SAMPLE   
4 PRESERVED_SPECIMEN
5 FOSSIL_SPECIMEN   
6 OCCURRENCE        
7 MATERIAL_CITATION 

distinct() is used to see unique values

5.1 Basic data filtering

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ %>%
  group_by(basisOfRecord) %>%
  count()
# A tibble: 7 × 2
# Groups:   basisOfRecord [7]
  basisOfRecord          n
  <chr>              <int>
1 FOSSIL_SPECIMEN      200
2 HUMAN_OBSERVATION   6524
3 MATERIAL_CITATION    206
4 MATERIAL_SAMPLE      105
5 OBSERVATION           77
6 OCCURRENCE            11
7 PRESERVED_SPECIMEN   922

group_by() is used to group values within a variable

5.1 Basic data filtering

  • basisOfRecord: we want preserved specimens or observations
mammalsCZ <- mammalsCZ %>%
  filter(basisOfRecord == "PRESERVED_SPECIMEN" |
    basisOfRecord == "HUMAN_OBSERVATION")

Note the use of | (OR) to filter the data. Another alternative is filter(basisOfRecord %in% c("PRESERVED_SPECIMEN","HUMAN_OBSERVATION")).


How many records do we have now?

nrow(mammalsCZ)
[1] 7446

5.1 Basic data filtering

  • taxonRank: we want records at species level
mammalsCZ %>% distinct(taxonRank)
# A tibble: 5 × 1
  taxonRank 
  <chr>     
1 SPECIES   
2 SUBSPECIES
3 GENUS     
4 ORDER     
5 FAMILY    

5.1 Basic data filtering

  • taxonRank: we want records at species level
mammalsCZ <- mammalsCZ %>% 
  filter(taxonRank == 'SPECIES')


How many records do we have now?

nrow(mammalsCZ)
[1] 7073

5.1 Basic data filtering

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km
mammalsCZ %>%
  filter(coordinateUncertaintyInMeters >= 10000) %>%
  select(scientificName, 
         coordinateUncertaintyInMeters, 
         stateProvince)
# A tibble: 309 × 3
   scientificName                           coordinateUncertaint…¹ stateProvince
   <chr>                                                     <dbl> <chr>        
 1 Bison bonasus (Linnaeus, 1758)                            26389 Středočeský  
 2 Bison bonasus (Linnaeus, 1758)                            26389 Středočeský  
 3 Lutra lutra (Linnaeus, 1758)                              26614 Jihomoravský 
 4 Procyon lotor (Linnaeus, 1758)                            26582 Jihočeský    
 5 Sciurus vulgaris Linnaeus, 1758                           22379 Prague       
 6 Clethrionomys glareolus (Schreber, 1780)                  26550 Jihočeský    
 7 Rhinolophus hipposideros (Bechstein, 18…                  26454 Moravskoslez…
 8 Lutra lutra (Linnaeus, 1758)                              26614 Niederösterr…
 9 Rhinolophus hipposideros (Bechstein, 18…                  26454 Moravskoslez…
10 Myotis myotis (Borkhausen, 1797)                          26454 Moravskoslez…
# ℹ 299 more rows
# ℹ abbreviated name: ¹​coordinateUncertaintyInMeters

5.1 Basic data filtering

  • coordinateUncertaintyInMeters: we want them to be smaller than 10km
mammalsCZ <- mammalsCZ %>% 
  filter(coordinateUncertaintyInMeters < 10000) # keeping this


How many records do we have now?

nrow(mammalsCZ)
[1] 5110

6 Basic maps

How are the records distributed?

We’ll get to this next week :)

6 Basic maps

And finally, a simple trick to produce separate maps per order.

In summary

  1. Identify a data type and source
  2. Check data-sharing agreements and licences
  3. Download data and associated metadata
  4. Check data quality (e.g., dates, spatial info, taxonomy)
  5. Clean data for purpose

Any doubts?

References

König, Christian, Rafael O. Wüest, Catherine H. Graham, Dirk Nikolaus Karger, Thomas Sattler, Niklaus E. Zimmermann, and Damaris Zurell. 2021. “Scale Dependency of Joint Species Distribution Models Challenges Interpretation of Biotic Interactions.” Journal of Biogeography 48 (7): 1541–51. https://doi.org/10.1111/jbi.14106.